Tagging Spoken Language Using Written Language Statistics

نویسندگان

  • Joakim Nivre
  • Leif Grönqvist
  • Malin Gustafsson
  • Torbjörn Lager
  • Sylvana Sofkova Hashemi
چکیده

This paper reports on two experiments with a probabilistic part-of-speech tagger, trained on a tagged corpus of written Swedish, being used to tag a corpus of (transcribed) spoken Swedish. The results indicate that with very little adaptations an accuracy rate of 85% can be achieved, with an accuracy rate for known words of 90%. In addition, two different t reatments of pauses were explored but with no significant gain in accuracy under either condition. 1 I n t r o d u c t i o n What happens when we take a probabilistic partof-speech tagger trained on written language and try to use it on spoken language transcriptions? The answer to this question is interesting from several points of view, some more practical and some more theoretically oriented. From a practical point of view, it, is interesting to know how well a written language tagger can perform on spoken language, because it may save us a lot of work if we can reuse existing taggers instead of developing new ones for spoken language. Front a more theoretical point of view, the results of such an experiment may tell us something about the ways in which the strncture of spoken language is different (or not so different;) from that of written language. In this paper, we report on experimental work dealing with the part-of-speech tagging of a corpus of (transcribed) spoken Swedish. The tagger used implements a s tandard probabilistic biclass model (see, e. g., (DeRose 1988)) trained on a tagged subset of the Stockhohn-Ume£ Corpus of written Swedish (Ejerhed et al 1992). Given that the transcriptions contain many modifications of s tandard orthography (in order to capture spoken language variants, reductions, etc.) a special lexicon had to be developed to map spoken langnage variants onto their canonical written language forms. In addition, a special tokenizer had to be developed to handle "recta-symbols" in the transcriptions, such as markers for pauses, overlapping speech, inaudible speech, etc. One of the interesting issues in this context is what use (if any) should be made of information about panses, interruptions, etc. In the experiment reported here, we compare two different t reatments of pauses and evaluate the performance of the tagger under these two different conditions. 2 B a c k g r o u n d 2.1 Probabilistie Part-of-speech Tagging The problem of (automatically) assigning parts of speech to words in context has received a lot of at tention within computat ional corpus linguistics. A variety of diffexent methods have been investigated, most of which fall into two broad classes: • Probabilistic methods, e. g. (DeRose 1988; Cutting et al 1992; Merialdo 1994). • Rule-based methods, e. g. (Brodda 1982; Karlsson 1990; Koskennienfi 1990; Brill 1992). Probabilistic taggers have typically been implemented as hidden Markov models, using prohabilistic models with two kinds of' basic probabilities: • The lexical probability of seeing the word w given the part-of-speech t: P(w I t). • The contextual pwbability of seeing the part-of-speech ti given the context of n 1 parts-of-speech: P(ti I t i -( ,~,) , . . . , t i 1). Models of this kind are usually referred to as nclass models, the most common instances of which are the biclass (n = 2) and triclass (n = 3) models. The lexical and contextual probabilities of an nclass tagger are usually est imated using one of two methods: ~ 1The terms 'RF training' and 'ML training' are taken from Merialdo 1994. It should be pointed out, though, that the use of relative frequencies to estimate occurrence probabilities is also a case of maximmn likelihood estimation (MLE).

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Using PiTagger for Lemmatization and PoS Tagging of a Spontaneous Speech Corpus: C-Oral-Rom Italian

The automatic lemmatization and morpho-syntactic annotation of spoken language is a quite recent and complex task for Natural Language Processing. The state of the art on written corpora don’t provide us with a satisfactory level of analysis regarding spontaneous spoken language (Uchimoto et al., 2002; Moreno & Guirao, 2003). The spontaneous speech corpus Italian C-ORALROM has been tagged with ...

متن کامل

Tagging spoken corpus

Spoken languages are more flexible in usage than written languages. Thus, tagging spoken corpus differs from tagging traditional written corpus. This paper proposes a new framework for tagging spoken corpus. The framework adopts the written tagger to process spoken data with the special consideration of the characteristics of spoken language. Besides, the problems of different tagging sets betw...

متن کامل

Tagging a Corpus of Spoken Swedish

In this article, we present and evaluate a method for training a statistical partof-speech tagger on data from written language and then adapting it to the requirements of tagging a corpus of transcribed spoken language, in our case spoken Swedish. This is currently a significant problem for many research groups working with spoken language, since the availability of tagged training data from s...

متن کامل

STTS 2.0? Improving the Tagset for the Part-of-Speech-Tagging of German Spoken Data

Part-of-speech tagging (POS-tagging) of spoken data requires different means of annotation than POS-tagging of written and edited texts. In order to capture the features of German spoken language, a distinct tagset is needed to respond to the kinds of elements which only occur in speech. In order to create such a coherent tagset the most prominent phenomena of spoken language need to be analyze...

متن کامل

The 8 th Linguistic Annotation Workshop in conjunction with COLING 2014

Part-of-speech tagging (POS-tagging) of spoken data requires different means of annotation than POS-tagging of written and edited texts. In order to capture the features of German spoken language, a distinct tagset is needed to respond to the kinds of elements which only occur in speech. In order to create such a coherent tagset the most prominent phenomena of spoken language need to be analyze...

متن کامل

Adult’s Learning Strategies for Receptive Skill Self-managing or Teacher-managing

Receptive language skill refers to answering appropriately to another person's spoken language. A lot of teachers try to develop receptive language skills in their language learners. When receptive language skills are not appropriately acquired, learners may miss significant learning opportunities resulting in delays in the development and acquisition of spoken language. The goals of this paper...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1996